80 research outputs found

    FastBLAST: Homology Relationships for Millions of Proteins

    Get PDF
    BackgroundAll-versus-all BLAST, which searches for homologous pairs of sequences in a database of proteins, is used to identify potential orthologs, to find new protein families, and to provide rapid access to these homology relationships. As DNA sequencing accelerates and data sets grow, all-versus-all BLAST has become computationally demanding.Methodology/principal findingsWe present FastBLAST, a heuristic replacement for all-versus-all BLAST that relies on alignments of proteins to known families, obtained from tools such as PSI-BLAST and HMMer. FastBLAST avoids most of the work of all-versus-all BLAST by taking advantage of these alignments and by clustering similar sequences. FastBLAST runs in two stages: the first stage identifies additional families and aligns them, and the second stage quickly identifies the homologs of a query sequence, based on the alignments of the families, before generating pairwise alignments. On 6.53 million proteins from the non-redundant Genbank database ("NR"), FastBLAST identifies new families 25 times faster than all-versus-all BLAST. Once the first stage is completed, FastBLAST identifies homologs for the average query in less than 5 seconds (8.6 times faster than BLAST) and gives nearly identical results. For hits above 70 bits, FastBLAST identifies 98% of the top 3,250 hits per query.Conclusions/significanceFastBLAST enables research groups that do not have supercomputers to analyze large protein sequence data sets. FastBLAST is open source software and is available at http://microbesonline.org/fastblast

    PhyloPat: phylogenetic pattern analysis of eukaryotic genes

    Get PDF
    BACKGROUND: Phylogenetic patterns show the presence or absence of certain genes or proteins in a set of species. They can also be used to determine sets of genes or proteins that occur only in certain evolutionary branches. Phylogenetic patterns analysis has routinely been applied to protein databases such as COG and OrthoMCL, but not upon gene databases. Here we present a tool named PhyloPat which allows the complete Ensembl gene database to be queried using phylogenetic patterns. DESCRIPTION: PhyloPat is an easy-to-use webserver, which can be used to query the orthologies of all complete genomes within the EnsMart database using phylogenetic patterns. This enables the determination of sets of genes that occur only in certain evolutionary branches or even single species. We found in total 446,825 genes and 3,164,088 orthologous relationships within the EnsMart v40 database. We used a single linkage clustering algorithm to create 147,922 phylogenetic lineages, using every one of the orthologies provided by Ensembl. PhyloPat provides the possibility of querying with either binary phylogenetic patterns (created by checkboxes) or regular expressions. Specific branches of a phylogenetic tree of the 21 included species can be selected to create a branch-specific phylogenetic pattern. Users can also input a list of Ensembl or EMBL IDs to check which phylogenetic lineage any gene belongs to. The output can be saved in HTML, Excel or plain text format for further analysis. A link to the FatiGO web interface has been incorporated in the HTML output, creating easy access to functional information. Finally, lists of omnipresent, polypresent and oligopresent genes have been included. CONCLUSION: PhyloPat is the first tool to combine complete genome information with phylogenetic pattern querying. Since we used the orthologies generated by the accurate pipeline of Ensembl, the obtained phylogenetic lineages are reliable. The completeness and reliability of these phylogenetic lineages will further increase with the addition of newly found orthologous relationships within each new Ensembl release

    Defining Reference Sequences for Nocardia Species by Similarity and Clustering Analyses of 16S rRNA Gene Sequence Data

    Get PDF
    International audienceBACKGROUND: The intra- and inter-species genetic diversity of bacteria and the absence of 'reference', or the most representative, sequences of individual species present a significant challenge for sequence-based identification. The aims of this study were to determine the utility, and compare the performance of several clustering and classification algorithms to identify the species of 364 sequences of 16S rRNA gene with a defined species in GenBank, and 110 sequences of 16S rRNA gene with no defined species, all within the genus Nocardia. METHODS: A total of 364 16S rRNA gene sequences of Nocardia species were studied. In addition, 110 16S rRNA gene sequences assigned only to the Nocardia genus level at the time of submission to GenBank were used for machine learning classification experiments. Different clustering algorithms were compared with a novel algorithm or the linear mapping (LM) of the distance matrix. Principal Components Analysis was used for the dimensionality reduction and visualization. RESULTS: The LM algorithm achieved the highest performance and classified the set of 364 16S rRNA sequences into 80 clusters, the majority of which (83.52%) corresponded with the original species. The most representative 16S rRNA sequences for individual Nocardia species have been identified as 'centroids' in respective clusters from which the distances to all other sequences were minimized; 110 16S rRNA gene sequences with identifications recorded only at the genus level were classified using machine learning methods. Simple kNN machine learning demonstrated the highest performance and classified Nocardia species sequences with an accuracy of 92.7% and a mean frequency of 0.578. CONCLUSION: The identification of centroids of 16S rRNA gene sequence clusters using novel distance matrix clustering enables the identification of the most representative sequences for each individual species of Nocardia and allows the quantitation of inter- and intra-species variability

    Ultra-fast sequence clustering from similarity networks with SiLiX

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The number of gene sequences that are available for comparative genomics approaches is increasing extremely quickly. A current challenge is to be able to handle this huge amount of sequences in order to build families of homologous sequences in a reasonable time.</p> <p>Results</p> <p>We present the software package <monospace>SiLiX</monospace> that implements a novel method which reconsiders single linkage clustering with a graph theoretical approach. A parallel version of the algorithms is also presented. As a demonstration of the ability of our software, we clustered more than 3 millions sequences from about 2 billion BLAST hits in 7 minutes, with a high clustering quality, both in terms of sensitivity and specificity.</p> <p>Conclusions</p> <p>Comparing state-of-the-art software, <monospace>SiLiX</monospace> presents the best up-to-date capabilities to face the problem of clustering large collections of sequences. <monospace>SiLiX</monospace> is freely available at <url>http://lbbe.univ-lyon1.fr/SiLiX</url>.</p

    Genetic variants of CYP3A5, CYP2D6, SULT1A1, UGT2B15 and tamoxifen response in postmenopausal patients with breast cancer

    Get PDF
    INTRODUCTION: Tamoxifen therapy reduces the risk of recurrence and prolongs the survival of oestrogen-receptor-positive patients with breast cancer. Even if most patients benefit from tamoxifen, many breast tumours either fail to respond or become resistant. Because tamoxifen is extensively metabolised by polymorphic enzymes, one proposed mechanism underlying the resistance is altered metabolism. In the present study we investigated the prognostic and/or predictive value of functional polymorphisms in cytochrome P450 3A5 CYP3A5 (*3), CYP2D6 (*4), sulphotransferase 1A1 (SULT1A1; *2) and UDP-glucuronosyltransferase 2B15 (UGT2B15; *2) in tamoxifen-treated patients with breast cancer. METHODS: In all, 677 tamoxifen-treated postmenopausal patients with breast cancer, of whom 238 were randomised to either 2 or 5 years of tamoxifen, were genotyped by using PCR with restriction fragment length polymorphism or PCR with denaturing high-performance liquid chromatography. RESULTS: The prognostic evaluation performed in the total population revealed a significantly better disease-free survival in patients homozygous for CYP2D6*4. For CYP3A5, SULT1A1 and UGT2B15 no prognostic significance was observed. In the randomised group we found that for CYP3A5, homozygous carriers of the *3 allele tended to have an increased risk of recurrence when treated for 2 years with tamoxifen, although this was not statistically significant (hazard ratio (HR) = 2.84, 95% confidence interval (CI) = 0.68 to 11.99, P = 0.15). In the group randomised to 5 years' tamoxifen the survival pattern shifted towards a significantly improved recurrence-free survival (RFS) among CYP3A5*3-homozygous patients (HR = 0.20, 95% CI = 0.07 to 0.55, P = 0.002). No reliable differences could be seen between treatment duration and the genotypes of CYP2D6, SULT1A1 or UGT2B15. The significantly improved RFS with prolonged tamoxifen treatment in CYP3A5*3 homozygotes was also seen in a multivariate Cox model (HR = 0.13, CI = 0.02 to 0.86, P = 0.03), whereas no differences could be seen for CYP2D6, SULT1A1 and UGT2B15. CONCLUSION: The metabolism of tamoxifen is complex and the mechanisms responsible for the resistance are unlikely to be explained by a single polymorphism; instead it is a combination of several mechanisms. However, the present data suggest that genetic variation in CYP3A5 may predict response to tamoxifen therapy

    Shotgun sequencing of Yersinia enterocolitica strain W22703 (biotype 2, serotype O:9): genomic evidence for oscillation between invertebrates and mammals

    Get PDF
    <p>Abstract</p> <p>Background</p> <p><it>Yersinia enterocolitica </it>strains responsible for mild gastroenteritis in humans are very diverse with respect to their metabolic and virulence properties. Strain W22703 (biotype 2, serotype O:9) was recently identified to possess nematocidal and insecticidal activity. To better understand the relationship between pathogenicity towards insects and humans, we compared the W22703 genome with that of the highly pathogenic strain 8081 (biotype1B; serotype O:8), the only <it>Y. enterocolitica </it>strain sequenced so far.</p> <p>Results</p> <p>We used whole-genome shotgun data to assemble, annotate and analyse the sequence of strain W22703. Numerous factors assumed to contribute to enteric survival and pathogenesis, among them osmoregulated periplasmic glucan, hydrogenases, cobalamin-dependent pathways, iron uptake systems and the <it>Yersinia </it>genome island 1 (YGI-1) involved in tight adherence were identified to be common to the 8081 and W22703 genomes. However, sets of ~550 genes revealed to be specific for each of them in comparison to the other strain. The plasticity zone (PZ) of 142 kb in the W22703 genome carries an ancient flagellar cluster Flg-2 of ~40 kb, but it lacks the pathogenicity island YAPI<sub>Ye</sub>, the secretion system <it>ysa </it>and <it>yts1</it>, and other virulence determinants of the 8081 PZ. Its composition underlines the prominent variability of this genome region and demonstrates its contribution to the higher pathogenicity of biotype 1B strains with respect to W22703. A novel type three secretion system of mosaic structure was found in the genome of W22703 that is absent in the sequenced strains of the human pathogenic <it>Yersinia </it>species, but conserved in the genomes of the apathogenic species. We identified several regions of differences in W22703 that mainly code for transporters, regulators, metabolic pathways, and defence factors.</p> <p>Conclusion</p> <p>The W22703 sequence analysis revealed a genome composition distinct from other pathogenic <it>Yersinia enterocolitica </it>strains, thus contributing novel data to the <it>Y. enterocolitica </it>pan-genome. This study also sheds further light on the strategies of this pathogen to cope with its environments.</p

    Evidence-Based Annotation of Gene Function in Shewanella oneidensis MR-1 Using Genome-Wide Fitness Profiling across 121 Conditions

    Get PDF
    Most genes in bacteria are experimentally uncharacterized and cannot be annotated with a specific function. Given the great diversity of bacteria and the ease of genome sequencing, high-throughput approaches to identify gene function experimentally are needed. Here, we use pools of tagged transposon mutants in the metal-reducing bacterium Shewanella oneidensis MR-1 to probe the mutant fitness of 3,355 genes in 121 diverse conditions including different growth substrates, alternative electron acceptors, stresses, and motility. We find that 2,350 genes have a pattern of fitness that is significantly different from random and 1,230 of these genes (37% of our total assayed genes) have enough signal to show strong biological correlations. We find that genes in all functional categories have phenotypes, including hundreds of hypotheticals, and that potentially redundant genes (over 50% amino acid identity to another gene in the genome) are also likely to have distinct phenotypes. Using fitness patterns, we were able to propose specific molecular functions for 40 genes or operons that lacked specific annotations or had incomplete annotations. In one example, we demonstrate that the previously hypothetical gene SO_3749 encodes a functional acetylornithine deacetylase, thus filling a missing step in S. oneidensis metabolism. Additionally, we demonstrate that the orphan histidine kinase SO_2742 and orphan response regulator SO_2648 form a signal transduction pathway that activates expression of acetyl-CoA synthase and is required for S. oneidensis to grow on acetate as a carbon source. Lastly, we demonstrate that gene expression and mutant fitness are poorly correlated and that mutant fitness generates more confident predictions of gene function than does gene expression. The approach described here can be applied generally to create large-scale gene-phenotype maps for evidence-based annotation of gene function in prokaryotes

    Glutamine versus Ammonia Utilization in the NAD Synthetase Family

    Get PDF
    NAD is a ubiquitous and essential metabolic redox cofactor which also functions as a substrate in certain regulatory pathways. The last step of NAD synthesis is the ATP-dependent amidation of deamido-NAD by NAD synthetase (NADS). Members of the NADS family are present in nearly all species across the three kingdoms of Life. In eukaryotic NADS, the core synthetase domain is fused with a nitrilase-like glutaminase domain supplying ammonia for the reaction. This two-domain NADS arrangement enabling the utilization of glutamine as nitrogen donor is also present in various bacterial lineages. However, many other bacterial members of NADS family do not contain a glutaminase domain, and they can utilize only ammonia (but not glutamine) in vitro. A single-domain NADS is also characteristic for nearly all Archaea, and its dependence on ammonia was demonstrated here for the representative enzyme from Methanocaldococcus jannaschi. However, a question about the actual in vivo nitrogen donor for single-domain members of the NADS family remained open: Is it glutamine hydrolyzed by a committed (but yet unknown) glutaminase subunit, as in most ATP-dependent amidotransferases, or free ammonia as in glutamine synthetase? Here we addressed this dilemma by combining evolutionary analysis of the NADS family with experimental characterization of two representative bacterial systems: a two-subunit NADS from Thermus thermophilus and a single-domain NADS from Salmonella typhimurium providing evidence that ammonia (and not glutamine) is the physiological substrate of a typical single-domain NADS. The latter represents the most likely ancestral form of NADS. The ability to utilize glutamine appears to have evolved via recruitment of a glutaminase subunit followed by domain fusion in an early branch of Bacteria. Further evolution of the NADS family included lineage-specific loss of one of the two alternative forms and horizontal gene transfer events. Lastly, we identified NADS structural elements associated with glutamine-utilizing capabilities
    • …
    corecore